Skip to content

Adding On-demand Training Data Notebook #162

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

KennSmithDS
Copy link

PR to merge the notebook tutorial for creating on-demand training data from the Planetary Computer data catalog when starting from a Radiant MLHub dataset.

@KennSmithDS KennSmithDS changed the title Adding On-demand Training Data Notebooke Adding On-demand Training Data Notebook May 3, 2022
Copy link

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I left a few comments inline. Still making my way through the example.

One general comment: I'm pretty uncomfortable having moderately complex code in this examples. I'd much prefer that things like temporal_buffer, mind_cloud_cover_scene and even get_landsat_8_match be generalized and put into a dedicated library, where it can be properly unit tested. With it here in a notebook it's not easy to test and not easy to reuse.

"source": [
"Once you have your API key, you will need to create a default profile by setting up a .mlhub/profiles file in your home directory. You can use the `mlhub configure` command line tool to do this:\n",
"\n",
"`$ mlhub configure`<br>\n",

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

!mlhub configure --api-key={MLHUB_API_KEY}

as a regular code cell should work.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

"id": "2ba7d7b2-7fed-4412-8c9b-448baad6e595",
"metadata": {},
"source": [
"This helper function below encapsulates the process of querying a STAC API endpoint to fetch an ItemCollection matching query criteria."

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO this helper function isn't adding much value over just using catalog.search directly. I'd rather teach users how to use catalog.search.

Can you remove uses of search_stac_api and the client_catalog.search instead?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

"id": "2e131160-50bb-487f-8fcb-96d37ce80167",
"metadata": {},
"source": [
"We could certainly use the method above to query label Items directly from our connection to the Radiant MLHub API endpoint. However, on very large collections, such as in the case with BigEarthNet, pagination becomes a bottleneck issue in obtaining and resolving STAC items, as it only returns 100 items at a time. Querying the entire Collection of nearly ~600,000 Items could take hours.\n",

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, the limit argument in pystac-client controls the size of the pages. But agreed that fetching 600,000 items through an API isn't what we should recommend.

"if not os.path.exists(label_collection_path):\n",
" collection = Collection.fetch(BIGEARTHNET_LABEL_COLLECTION)\n",
" archive_path = collection.download(TMP_DIR)\n",
" !tar -xf {archive_path.as_posix()} -C {TMP_DIR}\n",

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's unfortunate that this decompression takes so long :/ Any thoughts on if you can operate directly on the compressed .gz file? Probably not.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, it is unfortunate how long the decompression takes.

I didn't have a great work around to this, as it is related to your previous comment about using limit, and taking a random sample of Items from the larger dataset of 600,000 Items.

I suppose if we fetch only a few thousand to begin with, pagination shouldn't be an issue then we don't need to deal with the .tar.gz file. However that likely won't be a random result, I'm guessing the API will just return the first XXX Items?

"id": "e419d96f-f8b9-4911-b5d1-4a4077767062",
"metadata": {},
"source": [
"If we had the source collection archive downloaded and uncompressed in the same parent directory as the labels collection, we could reference the source Items and images directly. However the BigEarthNet source collection is over 60GB when compressed. Therefore to work around the disk size limitations of a Planetary Computer instance, we can query the same source items from the MLHub API endpoint, the same way we got the labels, but filter to the exact source item using IDs."

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's your preferred workflow as a user, and from MLHub's point of view? Do you want people making local copies of this dataset, or do you want them fetching from your storage on demand?

IMO, the ideal workflow is on-demand fetching from blob storage in the same region + caching, but I'm curious what you think.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my short time here at Radiant, I've seen a divergence between what is a preferred workflow for our users vs. the workflow most of our data users/consumers follow.

Personally I agree that the ideal workflow is an on-demand fetching of assets from blob storage via traversing a catalog/STAC API on a VM or notebook server in the same region. However, my assumption is that a majority of our users are downloading the datasets directly to their local computers so they can do machine learning, or other geospatial analytics on them, or many aren't familiar enough with STAC/PySTAC and STAC APIs to fetch the data that way.

Also with any on-demand hosted notebook server environment comes the issue of limited persistent volume sizes, especially if folks are accustomed to downloading instead of fetching and caching without writing to disk.

},
"outputs": [],
"source": [
"if best_l8_match:\n",

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question about potentially not having a match.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above for if source_items comment thread

},
"outputs": [],
"source": [
"explore_search_extent(ItemCollection([best_l8_match]))"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also plot the s2 chip's bound here?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to other comment threads, would it be best to strip this out of the helper function, and just have a few cells of repeated code for exploring the API search results?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clarifying question on this, do you mean the overall bounds of the ItemCollection returned (e.g. minX, minY, maxX, maxY)? Each Item returned from the API search will have it's own bounding box.

"metadata": {},
"outputs": [],
"source": [
"client = dd_client(\n",

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just use dask.distributed.Client() (or a GatewayCluster & get_client if doing this on a distributed cluster)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for both ps_client and dd_client I simplified by removing the import alias, so users will directly use distributed.Client and pystac_client.Client

"outputs": [],
"source": [
"# this cell will only work on PC or a machine with gateway cluster configured\n",
"# gateway = dask_gateway.Gateway()\n",

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO: pick whether or not you want this example to run on a cluster. If it runs without a cluster, then we can just remove this.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed

" Landsat 8 DataArray that has been cropped to label bbox\n",
" \"\"\"\n",
" # read label Item object\n",
" label_item = Item.from_file(\n",

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This, and the other functions touch the local filesystem won't work with a distributed cluster. They'd only work with a LocalCluster.

If we're mentioning possibly using a distributed cluster then you'd need to restructure this. Most likely, you'd need to store everything to Azure Blob Storage and use it as a kind of shared file system that each worker could read / write to.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For purposes of this tutorial, seems to make sense to focus on the workflow of gathering and processing the Landsat 8 data off of Planetary Computer with a LocalCluster, and not convoluting that with performing this workflow on a distributed cluster.

@KennSmithDS
Copy link
Author

@TomAugspurger do you know why the stackstac.stack function will add a buffer in the ndarrays returned? Does it have to do with how it reprojects the image data when it's cached?

For example in this block of code, I'm fetching the Sentinel-2 source imagery from the Azure Blob storage for our MLHub. We know the chips all to be 120x120 pixels, but the stack object dimensions vary from 122x122 up to 130x130.

s2_stack = stack( items=ItemCollection([source_item]), assets=BIGEARTHNET_RGB_BANDS, epsg=rio.open(get_redirect_url(source_item.assets["B02"])).crs.to_epsg(), resolution=10, )

P.S. sorry I don't know how you're doing the cool Jupyter Notebook integration.

@KennSmithDS
Copy link
Author

Closing in favor of #171 due to rebasing issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants